In this notebook, I will explore a dataset of sneaker orders from 100 sneaker shops. I will try to find out the reason behind a high average order value (AOV) of $3145.13 over a 30 day window. Then, I will determine a more accurate metric that would represent more this dataset.
I will be answering the following questions: Think about a better way to evaluate this data. What metric would you report for this dataset? What is its value?
local_data_path = "./data/shopify-dataset.csv"
import pandas as pd
# Read dataset into a pandas dataframe
df_data = pd.read_csv(local_data_path)
df_data
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 |
| 1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 |
| 2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 4:23:56 |
| 3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 |
| 4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 4:35:11 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 73 | 993 | 330 | 2 | debit | 2017-03-30 13:47:17 |
| 4996 | 4997 | 48 | 789 | 234 | 2 | cash | 2017-03-16 20:36:16 |
| 4997 | 4998 | 56 | 867 | 351 | 3 | cash | 2017-03-19 5:42:42 |
| 4998 | 4999 | 60 | 825 | 354 | 2 | credit_card | 2017-03-16 14:51:18 |
| 4999 | 5000 | 44 | 734 | 288 | 2 | debit | 2017-03-18 15:48:18 |
5000 rows × 7 columns
from pandas_profiling import ProfileReport
profile_report = ProfileReport(df_data)
profile_report.to_file(output_file='./data/output_pandas_profiling.html')
profile_report
By looking at the profile report and analyzing each variable, I noticed something strange in the order_amount variable. Indeed, the maximal value observed is 704 000 and the minimal value is 90. There is a major difference in both of these extremums. Also, the variable total_items had a major difference in its extremums with a minimum of 1 and a maximum of 2000 items.
By looking at the warnings generated by the profile report, I noticed that there was a high correlation between the variables I mentionned. Indeed, by looking at the correaltion matrices like Pearson's, we can clearly see a strong correlation between the order_amount and the total_items. This is expected since the order_amount is obtained by a product of the total_items with the price of the shoe that we do not have in this dataset.
Therefore, I decided to explore in more details the order_amount variable. It will probably lead to the same conclusions as analyzing the total_items variable because of their strong correlation.
Before exploring a dataset, it is always a good idea to look if there are any missing values. Even though, I could clearly see in the profile report that there are not missing values. I wanted to showcase a library I like using to visualize my data to observe how many missing values are present.
import missingno as mno
fig = mno.matrix(df_data)
fig_copy = fig.get_figure()
fig_copy.savefig('./data/matrix_missing_plot.png', bbox_inches = 'tight')
By looking at the previous diagram, we can see that there are no missing values in the dataset.
Here is an example of a dataset I worked with that had a lot of missing values.
The white lines represent the missing data points. We can clearly see a difference with the current shopify orders dataset that had a diagram with no missing data points
In this section, I decided to visualize the order_amount values in more details.
import matplotlib.pyplot as plt
# Create a figure and axis
fig, ax = plt.subplots()
# Scatter the order_id against the order_amount
ax.scatter(df_data['order_id'], df_data['order_amount'])
# Set a title and labels
ax.set_title('order_amount vs. order_id ')
ax.set_xlabel('order_id')
ax.set_ylabel('order_amount')
Text(0, 0.5, 'order_amount')
As we can see, most of the orders have an order_amount less than 100 000. We can clearly see one data point around 150 000 and a couple of data points at arounf 700 000.
# create a figure and axis
fig, ax = plt.subplots()
ax.hist(df_data["order_amount"])
# set title and labels
ax.set_title('Shopify shoes Dataset')
ax.set_xlabel('Order_amount')
ax.set_ylabel('Count')
Text(0, 0.5, 'Count')
df_data.plot(kind='box', figsize=(8, 6))
plt.title('Box Plot of Variables')
plt.ylabel('Order_amount')
plt.show()
import seaborn as sns
sns.boxplot(x=df_data['order_amount'])
<AxesSubplot:xlabel='order_amount'>
df_data[df_data["order_amount"]==704000]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 15 | 16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 60 | 61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 4:00:00 |
| 520 | 521 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 1104 | 1105 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 1362 | 1363 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-15 4:00:00 |
| 1436 | 1437 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-11 4:00:00 |
| 1562 | 1563 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-19 4:00:00 |
| 1602 | 1603 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-17 4:00:00 |
| 2153 | 2154 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-12 4:00:00 |
| 2297 | 2298 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 2835 | 2836 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 2969 | 2970 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 3332 | 3333 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 4056 | 4057 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 4646 | 4647 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 4868 | 4869 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-22 4:00:00 |
| 4882 | 4883 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-25 4:00:00 |
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(x=df_data["order_id"], y=df_data["order_amount"])
fig.show()
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
app = dash.Dash(__name__)
app.layout = html.Div([
dcc.Graph(id="scatter-plot"),
html.P("Order Amount:"),
dcc.RangeSlider(
id='range-slider',
min=0, max=800000, step=1,
marks={0: '0', 800000: '800000'},
value=[0, 500]
),
])
@app.callback(
Output("scatter-plot", "figure"),
[Input("range-slider", "value")])
def update_bar_chart(slider_range):
low, high = slider_range
mask = (df_data['order_amount'] > low) & (df_data['order_amount'] < high)
fig = px.scatter(
df_data[mask], x="order_id", y="order_amount",
color="order_amount", size='order_amount',
hover_data=['order_amount'])
return fig
app.run_server(debug=False)
Dash is running on http://127.0.0.1:8050/ * Serving Flask app "__main__" (lazy loading) * Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Debug mode: off
* Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)
from IPython.display import Video
Video("Dash.mp4", embed=True)
Let's see how many orders have an order amount equal to 7040000.
df_data[df_data["order_amount"]==704000].shape[0]
17
There are 17 orders with this outlier amount. We can also see that all these orders were made by using a credit card. Also, these were done by the same user with the id 607, the total amount of items ordered is 2000 and the order was always made at the same time 4:00:00 for different days in the month of march in the year 2017. The amount of items is way too big to be a normal order. Usually a regular user won't buy more than five pair of shoes.
Since 17 orders were made in the same month with an amount that big, this would explain why the average order value is $3145.13. Indeed, by looking at the profile report generated, I could see this value as the mean of the column order_amount.
The Z score is used to represent the relationship between the data point (observation) and the mean as well as the standard deviation of the dataset. The purpose is to find the distribution of data with a mean of 0 and a standard deviation of 1. This represents a normal distribution. Indeed, the Z-score consists of re-scaling and centralize the data to detect data points that have values far from zero.
import numpy as np
outliers = []
z_scores_list = []
def retrieve_outliers(column_data):
threshold = 3
mean = np.mean(column_data)
std = np.std(column_data)
print("Mean: " + str(mean))
print("Standard deviation: " + str(std))
for observation in column_data:
z_score = (observation - mean)/std
if np.abs(z_score) > threshold:
outliers.append(observation)
return outliers
outliers_order_amount = retrieve_outliers(df_data["order_amount"])
print("Outliers of the column order_amount are: " + str(outliers_order_amount))
print("There are " + str(len(outliers_order_amount)) + " outliers.")
Mean: 3145.128 Standard deviation: 41278.41088841982 Outliers of the column order_amount are: [704000, 704000, 704000, 154350, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000] There are 18 outliers.
# The 25th percentile
Q1 = df_data.quantile(0.25)
# The 75th percentile
Q3 = df_data.quantile(0.75)
# The interquartile range
IQR = Q3 - Q1
print(IQR)
order_id 2499.5 shop_id 51.0 user_id 150.0 order_amount 227.0 total_items 2.0 dtype: float64
is_outlier_df = (df_data < (Q1 - 1.5 * IQR)) |(df_data > (Q3 + 1.5 * IQR))
is_outlier_df.loc[is_outlier_df['order_amount'] == True]
| created_at | order_amount | order_id | payment_method | shop_id | total_items | user_id | |
|---|---|---|---|---|---|---|---|
| 15 | False | True | False | False | False | True | False |
| 60 | False | True | False | False | False | True | False |
| 99 | False | True | False | False | False | False | False |
| 136 | False | True | False | False | False | False | False |
| 160 | False | True | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4918 | False | True | False | False | False | False | False |
| 4927 | False | True | False | False | False | False | False |
| 4952 | False | True | False | False | False | False | False |
| 4958 | False | True | False | False | False | False | False |
| 4980 | False | True | False | False | False | False | False |
141 rows × 7 columns
def count_outliers(dataframe):
for col in dataframe.columns:
print("Column " + str(col) + " has " + str(dataframe[col].sum()) + " outliers.")
count_outliers(is_outlier_df)
Column created_at has 0 outliers. Column order_amount has 141 outliers. Column order_id has 0 outliers. Column payment_method has 0 outliers. Column shop_id has 0 outliers. Column total_items has 18 outliers. Column user_id has 0 outliers.
We can see that the mean is affected by these outliers.
df_data["order_amount"].mean()
3145.128
df_no_outliers = df_data[df_data["order_amount"] != 704000]
df_no_outliers["order_amount"].mean()
754.0919125025085
Clearly, the mean value is not an appropriate metric since it is highly impacted by the 17 values with \$ 70 4000 as order amount. The mean without the outliers is \\$ 754.09 which represents less than 25 percent of the value before removing the outiers \$ 3145.13 .
df_data["order_amount"].mode()
0 153 dtype: int64
df_no_outliers["order_amount"].mode()
0 153 dtype: int64
df_data["order_amount"].median()
284.0
df_no_outliers["order_amount"].median()
284.0
Mean, median and mode are measures of central tendency. But, the mean is the only value that is affected by the values of outliers. As we saw, the mean was more than four times bigger because of the outliers taken into consideration, but it was not the case for the mode neither the median.
We can clearly see that the median and the mode are not affected by the outliers. Indeed, I calculated both of these metrics before and after removing the outliers and their value is identical. It is not the case for the mean. For the mode, it's value is 153 and for the median, it is 284.
First, here the amount of orders interests us. Therefore, I am going to use COUNT() a function that returns the number of rows that matches a specified criterion. This function will be applied on the OrderID since it is the unique value and is considered as the primary key.
To access the Shipper's name, I have to access another table, the Shippers Table. I need to find the common information that would allow me to link the Orders table and the Shippers'. By looking at both tables, I can see that the common information is the ShipperID.
By calling, the INNER JOIN keyword, it selects records that have matching values in both tables.
Finally, I specify the condition using the WHERE clause to filter the records. In this case, I specified the ShipperName as "Speedy Express". Note that I am using aliases to specify the table the column is associated to to avoid any ambiguity.
The result of this SQL query indicates that 54 orders were shipped by Speedy Express.
In this case, I would like the last name of the employee which is a column present in the Employees table. But, I also need the quantity of orders per employee. Therefore, I also need information from the Orders table. To link both of these tables I call the INNER JOIN keywords on the Employee's ID. To merge all the orders with the same Employee ID together, I called the GROUP BY statement. It groups rows that have the same values into summary rows. In my case, It would group all the orders with the same ID together.
Finally, to have the highest amount of orders, I call the MAX() function on the count of orders total_count and retrieve the LastName associated to this value.
The result of this SQL query indicates that the last name of the employee with the most orders is Peacock with 40 orders.
In this case, I want the ProductName which is present in the Products table. I also need to know the amount of times this product was ordered. This can be interpreted in two ways.
1.If by "ordered the most", we only take the amount of orders with that product, than the query would be as follow:
The result of this SQL query indicates that the product that was ordered the most by customers in Germany is Gorgonzola Telino. It was ordered 5 times.
In this case, the product ordered the most is the Boston Crab Meat with a total of 160 times.
Both of these queries are very similar. For the first query, I went for amount of orders with a product. In this case, I used the function COUNT() to get the amount of orders. To link the Customers table with the Orders table, I used the keyword INNER JOIN on the CustomerID. After, to obtain the information about the product, I needed to link the Orders Table to another one. It will be the OrderDetails Table. Once again, I called the INNER JOIN keyword now on the ProductID. To only keep the rows with the counrty Germany, I used the WHERE clause. I ended the query by calling the GROUP BY statement to merge all the rows with the same ProductID.
The second query is the same but instead of going for the amount of orders, I requested the SUM() of the Quantity column.